BinaryBERT: Pushing the Limit of BERT Quantization
137
TABLE 5.5
Quantization results of BinaryBERT on SQuAD and MNLI-m.
Method
#Bits
Size
SQuAD-v1.1
MNLI-m
BERT-base
full-prec.
418
80.8/88.5
84.6
DistilBERT
full-prec.
250
79.1/86.9
81.6
LayerDrop-6L
full-prec.
328
-
82.9
LayerDrop-3L
full-prec.
224
-
78.6
TinyBERT-6L
full-prec.
55
79.7/87.5
82.8
ALBERT-E128
full-prec.
45
82.3/89.3
81.6
ALBERT-E768
full-prec.
120
81.5/88.6
82.0
Quant-Noise
PQ
38
-
83.6
Q-BERT
2/4-8-8
53
79.9/87.5
83.5
Q-BERT
2/3-8-8
46
79.3/87.0
81.8
Q-BERT
2-8-8
28
69.7/79.6
76.6
GOBO
3-4-32
43
-
83.7
GOBO
2-2-32
28
-
71.0
TernaryBERT
2-2-8
28
79.9/87.4
83.5
BinaryBERT
1-1-8
17
80.8/88.3
84.2
BinaryBERT
1-1-4
17
79.3/87.2
83.9
Then, the prediction-layer distillation minimizes the soft cross-entropy (SCE) between
quantized student logits ˆy and teacher logits y, i.e.,
ℓpred = SCE(ˆy, y).
(5.25)
After splitting from the half-sized ternary model, the binary model inherits its perfor-
mance on a new architecture with full width. However, the original minimum of the ternary
model may not hold in this new loss landscape after splitting. Thus, the authors further
proposed to fine-tune the binary model with prediction-layer distillation to look for a better
solution.
For implementation, the authors took DynaBERT [89] sub-networks as backbones, of-
fering both half-sized and full-sized models for easy comparison. Firstly, a ternary model of
width 0.5× with the two-stage knowledge distillation is trained until convergence. Then, the
authors splited it into a binary model with width 1.0×, and perform further fine-tuning with
prediction-layer distillation. Table 5.5 compares their proposed BinaryBERT with a variety
of state-of-the-art counterparts, including Q-BERT [208], GOBO [279], Quant-Noise [65]
and TernaryBERT [285] for quantizing BERT on MNLI of GLUE [230] and SQuAD [198].
Aside from quantization, other general compression approaches are also compared such
as DistillBERT [206], LayerDrop [64], TinyBERT [106], and ALBERT [126]. BinaryBERT
has the smallest model size with the best performance among all quantization approaches.
Compared with the full-precision model, BinaryBERT retains competitive performance with
significantly reduced model size and computation. For example, it achieves more than 24×
compression ratio compared with BERT-base, with only 0.4% ↓and 0.0%/0.2% ↓drop on
MNLI-m and SQuAD v1.1, respectively.
In summary, this paper’s contributions can be concluded as: (1) The first work to explore
BERT binarization with an analysis for the performance drop of binarized BERT models. (2)
A ternary weight-splitting method splits a trained ternary BERT to initialize BinaryBERT,
followed by fine-tuning for further refinement.